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ABSTRACT 

This paper addresses the problem of summarizing the posterior 
distributions that typically arise, in a Bayesian framework, when 
dealing with signal decomposition problems with unknown number 
of components. Such posterior distributions are defined over union 
of subspaces of differing dimensionality and can be sampled from 
using modern Monte Carlo techniques, for instance the increasingly 
popular RJ-MCMC method. No generic approach is available, how- 
ever, to summarize the resulting variable-dimensional samples and 
extract from them component-specific parameters. 

We propose a novel approach to this problem, which consists in 
approximating the complex posterior of interest by a "simple" — but 
still variable-dimensional — parametric distribution. The distance 
between the two distributions is measured using the Kullback- 
Leibler divergence, and a Stochastic EM-type algorithm, driven by 
the RJ-MCMC sampler, is proposed to estimate the parameters. The 
proposed algorithm is illustrated on the fundamental signal process- 
ing example of joint detection and estimation of sinusoids in white 
Gaussian noise. 

Index Terms — Bayesian inference; Posterior summarization; 
Trans-dimensional MCMC; Label-switching; Stochastic EM. 

1. INTRODUCTION 

Nowadays, owing to the advent of Markov Chain Monte Carlo 
(MCMC) sampling methods 1 1], Bayesian data analysis is consid- 
ered as a conventional approach in machine learning, signal and 
image processing, and data mining problems — to name but a few. 
Nevertheless, in many applications, practical challenges remain in 
the process of extracting, from the generated samples, quantities of 
interest to summarize the posterior distribution. 

Summarization consists, loosely speaking, in providing a few 
simple yet interpretable parameters and/or graphics to the end-user 
of a statistical method. For instance, in the case of a scalar parame- 
ter with a unimodal posterior distribution, measures of location and 
dispersion (e.g., the empirical mean and the standard deviation, or 
the median and the interquartile range) are typically provided in ad- 
dition to a graphical summary of the distribution (e.g., a histogram 
or a kernel density estimate). In the case of multimodal distribu- 
tions summarization becomes more difficult but can be carried out 
using, for instance, the approximation of the posterior by a Gaussian 
Mixture Model (GMMs) |2]. 

This paper addresses the problem of summarizing posterior dis- 
tributions in the case of trans-dimensional problems, i.e. "the prob- 
lems in which the number of things that we don't know is one of the 
things that we don't know" The problem of signal decomposi- 



tion when the number of components is unknown is an important ex- 
ample of such problems. Let y — {yi , 2/2 , . • . , 2/iv )* be a vector of 
A'^ observations, where the superscript t stands for vector transposi- 
tion. In signal decomposition problems, the model space is a finite or 
countable set of models, A4 = {Mk, k £ /C}, where /C C N is an 
index set. It is assumed here that, under A^fe, there are k components 
with component-specific parameters Okj G C R''. We denote 
by 0k ~ {Ok,i, ■ • ■ , Ok.k) £ Q** the vector of component-specific 
parameters. In a Bayesian framework, a joint posterior distribution 
is obtained through Bayes' formula for the model index k and the 
vector of component-specific parameters, after assigning prior dis- 
tributions on them : 

/(fc, 9k) cc p{y\k, Ok)p{ek\k)p{k), 

where oc indicates proportionality. This joint posterior distribution, 
defined over a union of subspaces of differing dimensionality, com- 
pletely describes the information (and the associated uncertainty) 
provided by the data y about the candidate models and the vector 
of unknown parameters. 

I.I. Illustrative example: sinusoid detection 

In this example, it is assumed that under Mk, y can be written 
as a linear combination of k sinusoids observed in white Gaus- 
sian noise. The unknown component-specific parameters are 
Ok = {sLk,'^k,4'k}, where a^, a;/., and <f)k are the vectors of 
amplitudes, radial frequencies and phases, respectively. We use 
the hierarchical model, prior distributions, and Reversible Jump 
MCMC (RJ-MCMC) sampler |3] proposed in |4] for this problem; 
the interested reader is thus referred to |3, 4] for more details. 

Figure [T] represents the posterior distributions of both the num- 
ber of components k and the sortec{] radial frequencies uifc given k 
obtained using the RJ-MCMC sampler. Each row is dedicated to one 
value of fc, for 2 < fc < 4; observe that, other models have negligi- 
ble posterior probabilities. In the experiment, the observed signal of 
length A'^ = 64 consists of three sinusoids with amplitudes s.k = 
(20,6.32,20)* and radial frequencies LJk = (0.63,0.68,0.73)*. 



The SNR : 



is set to the moderate value of 7dB, where 



Dfc is the design matrix and a'^ is the noise variance. 

Roughly speaking, two approaches co-exist in the literature for 
such situations: Bayesian Model Selection (BMS) and Bayesian 
Model Averaging (BMA). The BMS approach ranks models ac- 
cording to their posterior probabilities p(fc|y), selects one model. 



' Owing to the invaiiance of both the likeHhood and the prior under per- 
mutation of the components, component-specific marginal posteriors are all 
equal: this is the "label-switching" phenomenon |5, 6, 7]. Identifiability con- 
straints (such as sorting) are the simplest way of dealing with this issue. 




Fig. 1: Posteriors of k (left) and sorted radial frequencies, oj^, given 
k (right). The true number of components is three. The vertical 
dashed lines in the right figure locate the true radial frequencies. 



and then summarizes the posterior under the (fixed-dimensional) 
selected model. This is at the price of loosing valuable information 
provided by the other (discarded) models. For instance, in the ex- 
ample of Figure [T] all information about the small — and therefore 
harder to detect — middle component is lost by selecting the most 
a posteriori probable model M2- The BMA approach consists in 
reporting results that are averaged over all possible models; it is, 
therefore, not appropriate for studying component-specific parame- 
ters, the number of which changes in each modefl 

More information concerning these two approaches can be 
found in [3] and references therein. To the best of our knowledge, 
no generic method is currently available, that would allow to sum- 
marize the information that is so easily read on Figure [T] for this 
very simple example: namely, that there seem to be three sinusoidal 
components in the observed noisy signal, the middle one having a 
smaller probability of presence than the others. 

1.2. Outline of the paper 

In this paper, we propose a novel approach to summarize the poste- 
rior distributions over variable-dimensional subspaces that typically 
arise in signal decomposition problems with an unknown number 
of components. It consists in approximating the complex posterior 
distribution with a parametric model (of varying-dimensionality), 
by minimization of the Kullback-Leibler (KL) divergence between 
the two distributions. A Stochastic EM (SEM)-type algorithm |8'], 
driven by the output of an RJ-MCMC sampler, is used to estimate 
the parameters of the approximate model. 

Our approach shares some similarities with the relabeling algo- 
rithms proposed in 1 6, 7] to solve the "label switching" problem, and 
also with the EM algorithm used in |9] in the context of adaptive 
MCMC algorithms (both in a^xed-dimensional setting). The main 
contribution of this paper is the introduction of an original variable- 
dimensional parametric model, which allows to tackle directly the 
difficult problem of approximating a distribution defined over a 
union of subspaces of differing dimensionality — and thus provides 
a first solution to the "trans-dimensional label-switching" problem. 



so to speak. 

The paper is organized as follows. Section|2]introduces the pro- 
posed model and stochastic algorithm. Section [3] illustrates the ap- 
proach using the sinusoid detection example already discussed in the 
introduction. Finally, Section|4]concludes the paper and gives direc- 
tions for future work. 



2. PROPOSED ALGORITHM 



Let F denote the target posterior distribution, defined on the 



variable-dimensional space 



Ufc™o{'^} X ® • We assume 



that F admits a probability density function (pdf) /, with respect to 
the fcd-dimensional Lebesgue measure on each {k} x@^,k e K.. 
To keep things simple, we also assume that = R''. 

Our objective is to approximate the exact posterior density / 
using a "simple" parametric model q-r), where rj is the vector 
of parameters defining the model. The pdf will also be de- 
fined on the variable-dimensional space X (i.e., it is not a fixed- 
dimensional approximation as in the BMS approach). We assume 
that a Monte Carlo sampling method is available, e.g. a RJ-MCMC 
sampler |3], to generate M samples from /, which we denote 
by x'^' = (fc(^',0['(';)),for j = 1, . . . , M. 

2.1. Variable-dimensional parametric model 

Let us describe the proposed parametric model from a generative 
point of view. As in a traditional GMM, we assume that there is a 
certain number L of "Gaussian components" in the (approximate) 
posterior, each generating a d-variate Gaussian vector with mean /x; 
and covariance matrix Sj, 1 < / < L. 

An X-valued random variable x = (k, 9k), with < < L, 
is generated as follows. First, each of the L components can be ei- 
ther present or absent according to a binary indicator variable £ 
{0, 1}. These Bernoulli variables are assumed to be independent, 
and we denote by -ki G (0; 1] the "probability of presence" of the 
Gaussian component. Second, given the indicator variables, k — 
X^i^Li Gaussian vectors are generated by the Gaussian compo- 
nents that are present (^j = 1) and randomly arranged in a vector 

0k ~ {dk,l, ■ ■ ■ , 0k,k)- 

We denote by the pdf of the random variable x that is thus 
generated, with rji = (/xj, S;, ttj) the vector of parameters of the 
Gaussian component, 1 < I < L, and tj — (t/i, . . . , tjl). 

Remark. In contrast with GMMs, where only one component is 
present at a time (i.e., = 1 in our notations), there is no constraint 
here on the sum of the probabilities of presence. 

2.2. Estimating the model parameters 

One way to fit the parametric distribution §77 (x) to the poste- 
rior /(x) is to minimize the KL divergence of / from qr,, denoted by 
£'i<-L(/(x)||gr((x)). Thus, we define the criterion to be minimized 
as 



J{ri) ^ DA-L(/(x)||g^(x)) = / / (x) 



logi^dx. 
(7„(x) 



Using samples generated by the RJ-MCMC sampler, this criterion 
can be approximated as 



"See, however, the intensity plot provided in Section [3] (middle plot on 
Figure|4) as an example of a BMA summary related to a component-specific 
parameter. 



Jirj) ~ Jirj) 



--^log (g,(x«))+C. 



At the r iteration, 

S-step draw allocation vectors z*^*''"' ~ p ^ • | x'*', 17^ 

fori = 1, . . . , M. 
M-step estimate i)^^^ such that 

M 

f)^*^' — argmax^ log p ( x^'-* , z^''' ^ | rj 



Fig. 2: SEM algorithm. 

where C is a constant that does not depend on rj. One should note 
that minimizing J{rj) amounts to estimating tj such that 

M 

i) = argmax^^log (^g^(x''')^ . (1) 

j=i 

Now, we assume that each element of the i* observed sam- 
ple x^*', for j = 1, . . . , fc', has arisen from one of the L Gaussian 
components contained in q-r,. At this point, it is natural to introduce 
allocation vectors z'*' corresponding to the observed sample x'*', 
for i — 1, . . . , M, as latent variables. The element z^'' = I indi- 
cates that x^*' is allocated to the Z* Gaussian component. 

Hence, given the allocation vector z'*' and the parameters of the 
model ?7, the conditional distribution of the observed samples, i.e., 
the model's likelihood, is 

It turns out that the EM-type algorithms, which have been used 
in similar works 0,0, @]> are not appropriate for solving this prob- 
lem, as computing the expectation in the E-step is intricate. More 
explicitly, in our problem the computational burden of the summa- 
tion in the E-step over the set of all possible allocation vectors z 
increases very rapidly with k. In fact, even for moderate values 
of k, say, k — 10, the summation is far too expensive to compute 
as it involves k\ ~ 3.6 lO'^ terms. In this paper, we propose to 
use SEM 1 8], a variation of the EM algorithm in which the E-step 
is substituted with stochastic simulation of the latent variables from 
their conditional posterior distributions given the previous estimates 
of the unknown parameters. In other words, for i — 1, . . . , M, the 
allocation vectors z'*' are drawn from p{- \ x''^ V^^^)- This step is 
called the Stochastic (S)-step. Then, these random samples are used 
to construct the so-called pseudo-completed likelihood which reads 

p(x«,z«|r,) = nAr(x«|M,(0,S.)) 

where Z is the set of all allocation vectors and = 1 if and only 
if there is a j G {1, . . . , fe'*'} such that z^'' = I. The proposed 
SEM- type algorithm for our problem is described in Figure [2] 

Direct sampling from p( • Ix'*', yy^"^'), as required by the S-step, 
is unfortunately not feasible. Instead, since 

p(z(') |x«, r)*")) oc p(x''),z« iTy^")) 



can be computed up to a normalizing constant, we devised an 
Independent Metropolis-Hasting (I-MH) algorithm to construct a 
Markov chain withp(z'-'' | x'*', 17''^-') as its stationary distribution. 

2.3. Robustified algorithm 

Preliminary experiments with the model and method described in the 
previous sections proved to be disappointing. To understand why, it 
must be remembered that the pdf Qt, we are looking for is only an 
approximation (hopefully a good one) of the true posterior /. For in- 
stance, for high values of k, the posterior typically involves a diffuse 
part which can not properly represented by the parametric model 
(this can be seen quite clearly for fc = 4 on Figure [TJ. Therefore, 
for any r\, some samples generated by the RJ-MCMC sampler are 
outliers with respect to g^, (i.e., the true posterior can be considered 
as a contaminated version of q^^) which causes problems when using 
a maximum likelihood-type estimate such as (TJ. 

These robustness issues were solved, in this paper, using two 
modifications of the algorithm (only in the one-dimensional case up 
to now). First, robust estimates 1 10] of the means and variances of 
a Gaussian distribution, based on the median and the interquartile 
range, are used instead of the empirical means and variances in the 
M-step. Second, a Poisson process component (with uniform inten- 
sity) is added to the model, in order to account for the diffuse part 
of the posterior and allow for a number L of Gaussian components 
which is smaller than the maximum observed A:'''. 

Remark. Similar robustness concerns are widespread in the cluster- 
ing literature; see, e.g., fill] and the references therein. 

3. RESULTS 

In this section, we will investigate the capability of the proposed 
algorithm for summarizing variable-dimensional posterior distribu- 
tions. We emphasize again that the output of the trans-dimensional 
Monte Carlo sampler, e.g. RJ-MCMC in this paper, is considered as 
the observed data for our algorithm. Regarding the fact that in this 
paper we provide results for the sinusoids' radial frequencies, the 
proposed parametric model consists of univariate Gaussian compo- 
nents. In other words, the space of component-specific parameters 
© = (0; tt) C R. But we believe that our algorithm is not limited to 
the problems with one-dimensional component-specific parameters. 
Therefore, in this section, it is assumed that each Gaussian compo- 
nent has a mean p,, a variance s^, and a probability of presence tt to 
be estimated. 

Before launching the algorithm, first, we need to initialize the 
parametric model. It is natural to deduce the number L of Gaussian 
components from the posterior distribution of k. Here, we set it 
to the 90"^ percentile to keep all the probable models in the play. 
To initialize the Gaussian components' parameters, i.e. p and s^, 
we used the robust estimates of the posterior of the sorted radial 
frequencies given k = L. 

We ran the "robustified" stochastic algorithm introduced in Sec- 
tion [2] on the specific example shown in Figure [T] for 50 iterations, 
with L = 3 Gaussian components (the posterior probability of {k < 
3} is approximately 90.3%). Figure |3] illustrates the evolution of 
model parameters rj together with the criterion J . Two substan- 
tial facts can be deduced from this figure; first, the increasing be- 
havior of the criterion J, which is almost constant after the 10*'' 
iteration. Second, the convergence of the parameters of parametric 
model, esp. means and probabilities of presence tt, though using a 
naive initialization procedure. Indeed after the 40*'' iteration there is 
no significant move in the parameter estimates. Table[T]presents the 
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Fig. 3: Performance of the proposed summarizing algorithm on the 
sinusoid detection example. There are three Gaussian components 
in the model. 
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Fig. 4: The pdf of fitted Gaussian components (top), the histogram 
intensity of all radial frequencies samples (middle), and the his- 
togram intensity of the allocated samples to the Poisson point pro- 
cess component (bottom). 
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Table 1: Summaries of the variable-dimensional posterior distribu- 
tion shown in Figure[TJ The proposed approach vs. BMS. 



summaries provided by the proposed method along with the ones 
obtained using the BMS approach. Contrary to BMS, the method 
that we proposed has enabled us to benefit from the information of 
all probable models to give summaries about the middle harder to 
detect component. Turning to the results of our approach, it can be 
seen that the estimated means are compatible with the true radial 
frequencies. Furthermore, the estimated probabilities of presence 
are consistent with uncertainty of them in the variable-dimensional 
posterior shown in Figure [T] Note the small estimated standard de- 
viations which indicate our robustifying strategies have been useful. 

The pdf's of the estimated Gaussian components are shown in 
Figure |4] (top). Comparing with the posterior of sorted radial fre- 
quencies shown in Figure [T] it can be inferred that the proposed al- 
gorithm has managed to remove the label-switching phenomenon in 
a variable-dimensional problem. Furthermore, the intensity plot of 
the allocated samples to the point process component is depicted in 
Figure|4](bottom). This presents the outliers in the observed samples 
which cannot be be described by the Gaussian components. Note 
that without the point process component these outliers would be al- 
located to the Gaussian components which can, consequently, yield 
in a significant deterioration of parameter estimates. 

4. CONCLUSION 

In this paper, we have proposed a novel algorithm to summarize pos- 
terior distributions defined over union of subspaces of differing di- 
mensionality. For this purpose, a variable-dimensional parametric 
model has been designed to approximate the posterior of interest. 
The parameters of the approximate model have been estimated by 
means of a SEM-type algorithm, using samples from the posterior / 
generated by an RJ-MCMC algorithm. Modifications of our initial 



SEM-type algorithm have been proposed, in order to cope with the 
lack of robustness of maximum likelihood-type estimates. The rel- 
evance of the proposed algorithm, both for summarizing and for re- 
labeling variable-dimensional posterior distributions, has been illus- 
trated on the problem of detecting and estimating sinusoids in Gaus- 
sian white noise. 

We believe that this algorithm can be used in the vast domain 
of signal decomposition and mixture model analysis to enhance in- 
ference in trans-dimensional problems. For this purpose, generaliz- 
ing the proposed algorithm to the multivariate case and analyzing its 
convergence properties is considered as future work. Another impor- 
tant point would be to use a more reliable initialization procedure. 
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